Application of high-throughput single-nucleus DNA sequencing in pancreatic cancer

Despite insights gained by bulk DNA sequencing of cancer it remains challenging to resolve the admixture of normal and tumor cells, and/or of distinct tumor subclones; high-throughput single-cell DNA sequencing circumvents these and brings cancer genomic studies to higher resolution. However, its application has been limited to liquid tumors or a small batch of solid tumors, mainly because of the lack of a scalable workflow to process solid tumor samples. Here we optimize a highly automated nuclei extraction workflow that achieves fast and reliable targeted single-nucleus DNA library preparation of 38 samples from 16 pancreatic ductal adenocarcinoma patients, with an average library yield per sample of 2867 single nuclei. We demonstrate that this workflow not only performs well using low cellularity or low tumor purity samples but reveals genomic evolution patterns of pancreatic ductal adenocarcinoma as well.

Ploidy was calculated as follows: (1) Read count per (cell * amplicon) is normalized both across cells and amplicons as described in Methods.
(2) KRAS-mutated group is identified as cells carrying KRAS HOM/HET genotype; all other cells are assigned as putative normal cells.
(3) Putative normal cells are randomly split into 3 equally sized groups norm_a, norm_b, norm_c. Norm_a is used as diploid baseline and all other groups' absolute ploidy is calculated as the ratio of their normalized read counts to norm_a' normalized read counts. For sample PA02-1, the resulting absolute per-amplicon ploidy for each group is plotted above (a-d).
Source data are provided as a Source Data file. c. Digital droplet PCR results on KRAS variants in samples PR02-1 (used instead of PR02-4 because the latter's nuclei material was depleted) and PR02-3's leftover nuclei from Tapestri runs. Source data are provided as a Source Data file.

Supplementary Figure 8: single-cell SNV and CNV results of a KRAS WT PDAC
a-c. Single-cell genotype heatmap of 7 important genomic variants pre-identified by bulk WES (left) and genome-wide per-amplicon ploidy heatmap (right) for samples PR01-1 (a), PR01-2 (b), PR01-3 (c). Each cell's clone identity (middle) is colored as shown by labels. The FGFR1 and TGFBR2 SNV clones are defined as cells with non-WT genotype of each gene. The "putative normal" clone is defined as cells with "WT" genotype of all 7 genetic variants. Cells are hierarchically clustered within each clone.      In microfluidic experiments the doublet rate δ, expected proportion of doublets in the data, depends on the rate at which the cells flow through the nozzle and time-interval in which a droplet is formed.

code subsampled number of reads (million) FASTQ size (R1, Gb) Tapestri pipeline run time (hr) max memory (Gb) max processes max threads % reads mapped to panel number of total barcodes (million) % DNA read pairs assigned to cells number of cells called read depth (mean reads/cell/amplicon) number of KRAS-mutated cells % KRAS-mutated nuclei
When these quantities are small, the probability a doublet occurring can be approximated by a Binomial distribution [1][2][3] with a success probability δ.  We need to solve the following set of nonlinear equations: We solve these nonlinear equations numerically using the open-source package scipy.
The clustering algorithm was chosen because it allows for: • variable number of data states, which enables differentiating among heterozygous, homozygous, and missing SNV genotypes.
• variable layers of data, which enables inclusion of both SNV and CNV events.
without making any assumption of the evolutionary relationships among clusters.

input setup
For the SNV input matrix, we defined 4 states: 1. For an SNV to be considered HET in a cell, it needs to satisfy: (1) alternative allele read count > 0 (2) variant allele frequency (VAF) > 0.2 2. For an SNV to be considered HOM in a cell, in addition to the two requirements for HET above, it needs to satisfy: For an SNV to be considered MISS in a cell, it needs to satisfy: (1) total read depth = 0 3 The thresholds were intentionally set leniently (except for MISS) because we hoped to push the error correction to SCG.
For the CNV input matrix, to avoid the complexity of accurate copy number calling which is not in the scope of this paper, we only included the homozygous deletion status: -0 as ploidy>0; -1 as ploidy=0.
This was based on our assumption that a read count of 0 very likely (99% as defined in the priors) represents a real homozygous deletion and likewise for a read count of nonzero.
Ploidy status was determined by a hard threshold, too: for an amplicon to be ploidy=0, it needs to satisfy that the forward read count strictly equals 0.
To focus on studying SMAD4's CNV evolution, we only included SMAD4's 8 amplicons. He had the hypothesis from observing the raw data (Supplementary Figure 9) that the 8 amplicons underwent a likely unsynchronous, step-wise process to the final state of 8/8 homozygous deletion, assuming that a deleted DNA cannot be regained.
We concatenated 8061 cells from all samples (including 2 technical replicates) of patient PA04 to do the clustering. The input matrices and the run parameters in YAML format are included in Supplementary Data.

model selection
We set the number of clusters to 40, with the belief that clonal selection would be almost complete in such a late-stage PDAC case and the number of observable/computationally solvable clone with respect to the input loci/regions should be less than 40.

hyperparameters
input state emission density (γ): Exactly as set up in the original SCG paper, we assume that data observed for each SNV/CNV is noisy, such that given the true genotype/ploidy state of loci is s, the probability of observing a value t is given by ε s ∈ [0,1], which can be modeled by a Dirichlet distribution with hyperparameters γ s : ε s |γ s ∼Dirichlet(ε s |γ s ) These hyperparameters were set as shown in Supplementary Table 5 below, based on our understanding of the noise profile of the data.
cluster proportions: we set κ = 0.01, which controls the distribution of the cluster proportions.
doublet: We disabled modeling for doublets to improve the computational efficiency, given that the doublet rate had been estimated to be low.  Table 5: Parameter settings for the gamma parameter for SNV and CNV data types. The rows correspond to hidden states and the columns correspond to observed states. Values are pseudo-counts in the Dirichlet distribution for s. Each row thus represents a setting of γ s , the probability of observing t given s.

assessing convergence
After running SCG with 5000 random restarts, we recognize that the model is highly subject to local maxima. See below (Supplementary Figure 12) for the comparison of the evidence lower bound (ELBO) plots of 2000 and 5000 random restarts. The max ELBO improved very minimally and did not seem to be converging. Therefore, we did not move upwards from 5000 restarts.

building a phylogeny with inferred clusters
With the clustering returning only 7 distinct clusters for PA04, we thought inferring a clone tree was trivial: we defined a "genomic evolution cost" matrix for every SNV and CNV state transitions between each pair of unique clusters (Supplementary Table 6), and inferred the minimum spanning arborescence using a modified Edmond's algorithm implemented in the NetworkX Python package [4].